[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 #17031

snadampal · 2023-08-07T16:56:06Z

Description

This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"

The PR also adds new test cases for mlas and ort.

Motivation and Context

This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance.

cd onnxruntime/python/tools/transformers
python3 benchmark.py

And the unit test precision results are matching to sgemm kernel results.
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync

snadampal · 2023-08-11T15:25:26Z

appreciate if someone can review this PR.

snadampal · 2023-08-30T16:43:17Z

Hi @snnn , would you be able to review and provide feedback on this PR? appreciate your time.

snadampal · 2023-09-12T02:46:23Z

Hi, I have rebased the PR to resolve the merge conflicts. I'm happy to address any feedback you may have. Thank you!

milpuz01 · 2023-09-12T14:17:54Z

I have checked out the changes and run performance test and accuracy tests with and without flag using onnxruntime_perf_test (modified the binary to dump output for comparisons) on AWS Graviton3 instances and it was fine.

onnxruntime/core/providers/cpu/math/matmul.h

onnxruntime/core/providers/cpu/math/matmul.cc

snadampal · 2023-10-04T19:38:36Z

Hi @chenfucn , @yufenglee , I have updated the PR (1) to move to the newer gemm interface and (2) to add session option based fastmath mode control. Please review and let me know your feedback.

snadampal · 2023-10-11T18:01:38Z

Hi @chenfucn , @yufengle, appreciate if someone can trigger the CI for this PR. I have addressed all the feedback except the windows testing for which I'm waiting for the Windows CI results. Thank you!

chenfucn

As we discussed, please add mlas unit tests that call the kernel directly with different shapes are other parameters.

onnxruntime/core/providers/cpu/math/matmul.cc

onnxruntime/test/providers/base_tester.cc

chenfucn · 2023-10-12T17:39:06Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline

chenfucn · 2023-10-12T17:39:16Z

/azp run ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2023-10-12T17:39:41Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2023-10-12T17:39:54Z

Azure Pipelines successfully started running 9 pipeline(s).

cmake/onnxruntime_mlas.cmake

onnxruntime/core/mlas/lib/sbgemm.h

snadampal · 2023-10-13T17:09:59Z

Thanks for the review, I will update the PR to address this and also add unit tests.

snadampal · 2023-10-25T05:25:08Z

I have updated the PR to address all the feedback so far and also the learnings from my other qgemm PR.
(1) added the feature only for not Apple
(2) added mlas unit tests
(3) tested linux full build (both release and release with debug info)
(4) minimal build
(5) android build with cross compilation on x86. and (5) lintrunner and git-clang-format

Next, adding ort optimizer and provider tests to test the fastmath session.
Please review and let me know if any feedback on this version.

snnn · 2024-01-17T17:05:32Z

@snadampal did you push your change to Github?

snadampal · 2024-01-17T17:36:34Z

not yet,planning to push the code format changes along with the session name change

snadampal · 2024-01-18T15:29:37Z

Hi @skottmckay , appreciate your response on this.

Hi @skottmckay , if you could comment on the naming part, I will update the PR. the following is what I think should cover your suggestion. You had suggested replacing session with mlas. I think session makes sense because the config is still session specific, so I left it, but added mlas to it.

static const char* const kOrtSessionOptionsMlasGemmFastMathMode = "session.enable_mlas_gemm_fastmath_mode";

skottmckay · 2024-01-18T22:45:46Z

Hi @skottmckay Scott McKay, appreciate your response on this.
Hi @skottmckay
        Scott McKay
        , if you could comment on the naming part, I will update the PR. the following is what I think should cover your suggestion. You had suggested replacing session with mlas. I think session makes sense because the config is still session specific, so I left it, but added mlas to it.
static const char* const kOrtSessionOptionsMlasGemmFastMathMode = "session.enable_mlas_gemm_fastmath_mode";

I think I would consider the first name as something that points me to where I would find the setting being used. e.g. 'optimization' means look in the optimizer project. I would say it's inferred you're configuring something in the session as you're using SessionOptions (vs. say RunOptions). Based on that, I would vote for 'mlas.' as the prefix.

The name also seems a little too generic as it sounds like it would apply to MLAS as a whole. Unless we think there will be some other fastpath that applies to MLAS GEMM in general, a more specific name would be clearer. e.g. mlas.enable_gemm_fastmath_arm64_bfloat16

Or alternatively the platform/datatype could be in the value and you could parse that.

e.g. mlas.enable_gemm_fastmath_mode could have a value of arm64.bfloat16 and additional platform.data_type values could be parsed for. That is obviously more complicated so we should avoid unless we think it would be used. Possibly required if we had this GEMM fastpath on multiple platforms or for multiple data types though as I assume I'd want to be able to enable/disable each specific combination, and having a new config key for every single combination doesn't scale well.

onnxruntime/core/mlas/lib/qgemm_kernel_smmla.cpp

onnxruntime/core/mlas/lib/qgemm_kernel_ummla.cpp

onnxruntime/core/mlas/lib/sbgemm.h

snadampal · 2024-01-19T00:21:05Z

thank you, I see your point. bf16 and f16 are the potential fastmath options, but on aarch64, so far I see interest for bf16 fastmath alone. I agree that there may not be multiple of these for different platforms, so I will go ahead with a simple config key.

static const char* const kOrtSessionOptionsMlasGemmFastMathArm64Bfloat16 = "mlas.enable_gemm_fastmath_arm64_bfloat16";

onnxruntime/core/providers/cpu/math/matmul.cc

onnxruntime/test/mlas/unittest/test_sbgemm.cpp

onnxruntime/core/mlas/lib/platform.cpp

Added SbgemmKernel assembly implementation with bfmmla instructions and sbgemm utility functions to prepack Matrix B along with conversion to bfloat16.

sbgemm kernel is invoked when fastmath mode is enabled and HW supports the bf16 instruction set. It's disabled by default, set the following session option to 1 to enable it. "kOrtSessionOptionsMlasGemmFastMathArm64Bfloat16"

snadampal · 2024-01-20T17:00:37Z

Update the PR for the session name and other points discussed so far including the clang-formatting. Tested

release, debug and minimal builds on aarch64 neoverse v1 and n1 platforms
android build and linux cross compilation for aarch64 config on x86 platform

chenfucn · 2024-01-22T17:01:37Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline

chenfucn · 2024-01-22T17:01:56Z

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2024-01-22T17:02:14Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2024-01-22T17:02:32Z

Azure Pipelines successfully started running 8 pipeline(s).

snadampal · 2024-01-22T22:46:03Z

Thanks to @chenfucn , @snnn , @skottmckay and @yufenglee for the great feedback and merging the PR!

…oat16 (#17031) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `

snnn · 2024-01-24T18:36:39Z

@snadampal , thanks for making ONNX Runtime better. Welcome to bring more changes to us. You have my email. Do not hesitate to contact me anytime when you need help on reviewing PRs.

snadampal requested a review from a team as a code owner August 7, 2023 16:56

snadampal mentioned this pull request Aug 7, 2023

[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 #16687

Closed

snadampal force-pushed the sbgemm_aarch64 branch from be947da to 2aef2f3 Compare August 14, 2023 13:55

snadampal force-pushed the sbgemm_aarch64 branch from 2aef2f3 to 9b51325 Compare September 12, 2023 02:38

yufenglee reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Outdated Show resolved Hide resolved

yufenglee reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Outdated Show resolved Hide resolved

yufenglee reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Outdated Show resolved Hide resolved

chenfucn reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Outdated Show resolved Hide resolved

yufenglee reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.cc Outdated Show resolved Hide resolved

snadampal force-pushed the sbgemm_aarch64 branch 4 times, most recently from eb257ff to 83a6f6e Compare October 4, 2023 19:29

snadampal force-pushed the sbgemm_aarch64 branch from 83a6f6e to 2fffd44 Compare October 11, 2023 17:57

chenfucn reviewed Oct 12, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.cc Outdated Show resolved Hide resolved

onnxruntime/test/providers/base_tester.cc Outdated Show resolved Hide resolved

yufenglee reviewed Oct 12, 2023

View reviewed changes

cmake/onnxruntime_mlas.cmake Outdated Show resolved Hide resolved

chenfucn reviewed Oct 12, 2023

View reviewed changes

onnxruntime/core/mlas/lib/sbgemm.h Show resolved Hide resolved

snadampal force-pushed the sbgemm_aarch64 branch from 2fffd44 to cef62df Compare October 25, 2023 05:19

yufenglee reviewed Jan 19, 2024

View reviewed changes

onnxruntime/core/mlas/lib/qgemm_kernel_smmla.cpp Outdated Show resolved Hide resolved

yufenglee reviewed Jan 19, 2024

View reviewed changes

onnxruntime/core/mlas/lib/qgemm_kernel_ummla.cpp Outdated Show resolved Hide resolved

yufenglee added the release:1.17.0 label Jan 19, 2024

yufenglee reviewed Jan 19, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sbgemm.h Outdated Show resolved Hide resolved

chenfucn reviewed Jan 19, 2024

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.cc Show resolved Hide resolved

onnxruntime/test/mlas/unittest/test_sbgemm.cpp Show resolved Hide resolved

onnxruntime/core/mlas/lib/platform.cpp Show resolved Hide resolved

snadampal added 6 commits January 20, 2024 13:02

define aarch64 bf16 hwcaps checks in cpuinfo and platform

5240363

Add SBGEMM kernel to accelerate fp32 gemm with bfloat16

f8027c9

Added SbgemmKernel assembly implementation with bfmmla instructions and sbgemm utility functions to prepack Matrix B along with conversion to bfloat16.

add mlas unittests for sbgemm kernel

6376bfa

add optimizer QDQ Transformer MatMul tests for sbgemm fastmath mode

9aca49a

add ort execution provider math op matmul tests for sbgemm fastmath mode

d6d48c3

snadampal force-pushed the sbgemm_aarch64 branch from f45ef1d to d6d48c3 Compare January 20, 2024 16:11

chenfucn approved these changes Jan 22, 2024

View reviewed changes

snnn approved these changes Jan 22, 2024

View reviewed changes

snnn merged commit 77da2ef into microsoft:main Jan 22, 2024
55 of 56 checks passed

snadampal mentioned this pull request Jan 29, 2024

add arm64 bfloat16 fastmath mode option for transformers benchmarking script #19294

Merged

snnn mentioned this pull request Mar 25, 2024

[Build] Trying to build on a embedded device that doesn't support BFLOAT16 #19920

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 #17031

[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 #17031

snadampal commented Aug 7, 2023 •

edited

Loading

snadampal commented Aug 11, 2023

snadampal commented Aug 30, 2023

snadampal commented Sep 12, 2023

milpuz01 commented Sep 12, 2023

snadampal commented Oct 4, 2023

snadampal commented Oct 11, 2023

chenfucn left a comment

chenfucn commented Oct 12, 2023

chenfucn commented Oct 12, 2023

azure-pipelines bot commented Oct 12, 2023

azure-pipelines bot commented Oct 12, 2023

snadampal commented Oct 13, 2023

snadampal commented Oct 25, 2023 •

edited

Loading

snnn commented Jan 17, 2024

snadampal commented Jan 17, 2024 •

edited

Loading

snadampal commented Jan 18, 2024

skottmckay commented Jan 18, 2024

snadampal commented Jan 19, 2024

snadampal commented Jan 20, 2024

chenfucn commented Jan 22, 2024

chenfucn commented Jan 22, 2024

azure-pipelines bot commented Jan 22, 2024

azure-pipelines bot commented Jan 22, 2024

snadampal commented Jan 22, 2024

snnn commented Jan 24, 2024

[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 #17031

[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 #17031

Conversation

snadampal commented Aug 7, 2023 • edited Loading

Description

Motivation and Context

snadampal commented Aug 11, 2023

snadampal commented Aug 30, 2023

snadampal commented Sep 12, 2023

milpuz01 commented Sep 12, 2023

snadampal commented Oct 4, 2023

snadampal commented Oct 11, 2023

chenfucn left a comment

Choose a reason for hiding this comment

chenfucn commented Oct 12, 2023

chenfucn commented Oct 12, 2023

azure-pipelines bot commented Oct 12, 2023

azure-pipelines bot commented Oct 12, 2023

snadampal commented Oct 13, 2023

snadampal commented Oct 25, 2023 • edited Loading

snnn commented Jan 17, 2024

snadampal commented Jan 17, 2024 • edited Loading

snadampal commented Jan 18, 2024

skottmckay commented Jan 18, 2024

snadampal commented Jan 19, 2024

snadampal commented Jan 20, 2024

chenfucn commented Jan 22, 2024

chenfucn commented Jan 22, 2024

azure-pipelines bot commented Jan 22, 2024

azure-pipelines bot commented Jan 22, 2024

snadampal commented Jan 22, 2024

snnn commented Jan 24, 2024

snadampal commented Aug 7, 2023 •

edited

Loading

snadampal commented Oct 25, 2023 •

edited

Loading

snadampal commented Jan 17, 2024 •

edited

Loading